In this tutorial I will provide a basic overview of differential expression analysis for transcriptional profiling using RNA-Seq data. We will be using the DESeq2 library in R. This approach utilizes a variant on the assumption of a negative binomially set of counts. This approach assumes that all you have going in are counts, that have not been normalized either for library size (or number of mapped reads), not for transcript length.
This is a small part of the data for a project examining how both variation in sex and size (diet manipulations) influences transcriptional profiles for exaggerated traits (hyper-allometric) vs traits with normal allometry. This is from one particular species of rhino beetle where for the full study we extracted RNA from the developing tissues for the head and thoracic horns, wing and genitals.
Here is an example of what these critters looks like from Wikipedia
While it is not particularly relevant to this tutorial, the reads were mapped to a de novo transcriptome
Please note: eXpress does not sort based on the transcript, so pre-sort either in shell or R. Also eXpress files are tab delimited.
R loaded, and let’s get started.First we load in libraries in R.
library(DESeq2)
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 3.2.2
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 3.2.2
## Loading required package: GenomeInfoDb
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.2.2
## Loading required package: RcppArmadillo
## Warning: package 'RcppArmadillo' was built under R version 3.2.2
library("RColorBrewer")
library("gplots")
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:IRanges':
##
## space
##
## The following object is masked from 'package:stats':
##
## lowess
Set the working directory for the raw count data (you need to know where you put it). I will go over how I organize my projects to keep this simple
#setwd("../data")
DESeq2 and other libraries often have helper functions for getting your count data in. In particular if you are using objects created from other tools that the same authors generated. However, if you are going to make your own pipeline, it is important to know how to write some simple R code to be able to get your data in, and to parse it so that it is in the format you need.
Let’s start by looking at one of the files that eXpress generates. As you will see, for each sample we have a whole text file with multiple columns. For this tutorial we only have 16 samples we have to worry about, but even so trying to copy and paste columns would be a horrible idea.
Instead we will read in all of the files and extract what we want.
This just makes an object storing all of the file names we need. We only need the files that end in _results.xprs (eXpress generates log files with similar names which we are not interested in).
in_dir = dir(, pattern="_results.xprs")
Let’s take a look at in_dir in the R console.
When loading large data sets into R (something R is not great at), there are a few tricks (including using libraries that help like data.table library). The easiest and most helpful thing to speed it up is to specify how many rows the data will have.
So How many genes are there? We ask R to issue a system (i.e. Unix) command for the first file in the in_dir object.
system(paste("wc -l ",in_dir[[1]]))
Now we import all of the data as a list of data frames.
counts_in <- lapply(in_dir, function(x) read.table(x, header=T, sep = "", nrows=5077))
One annoying aspect of using eXpress is that it does not sort the data. Let’s take a look at this.
head(counts_in[[1]])
## bundle_id target_id length
## 1 1 c13073_g5_i1&42..743@gi|642915734|ref|XP_008190781.1| 766
## 2 1 c13073_g5_i2&205..906@gi|642915738|ref|XP_008190784.1| 929
## 3 2 c8862_g1_i4&165..938@gi|642913267|ref|XP_008201463.1| 1392
## 4 3 c13004_g1_i1&632..2767@gi|642916547|ref|XP_008191645.1| 5549
## 5 4 c13531_g1_i4&299..2704@gi|642928531|ref|XP_008195362.1| 4052
## 6 5 c6406_g2_i1&282..794@gi|642934845|ref|XP_008197834.1| 1143
## eff_length tot_counts uniq_counts est_counts eff_counts
## 1 566.4421 50 0 0.000188 0.000255
## 2 677.8334 78 28 77.999812 106.902121
## 3 877.3345 475 475 475.000000 753.646435
## 4 7179.3722 2755 2755 2755.000000 2129.363757
## 5 3353.3037 187 187 187.000000 225.963426
## 6 685.5630 1522 1522 1522.000000 2537.543547
## ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low
## 1 6.928172e-05 1.838578e+01 3.688514e-05 0.000000
## 2 2.152981e+00 8.112911e-06 1.276064e+01 12.720980
## 3 0.000000e+00 0.000000e+00 6.003858e+01 60.038580
## 4 0.000000e+00 0.000000e+00 4.255368e+01 42.553680
## 5 0.000000e+00 0.000000e+00 6.184018e+00 6.184018
## 6 0.000000e+00 0.000000e+00 2.461894e+02 246.189400
## fpkm_conf_high solvable tpm
## 1 0.01078445 TRUE 4.028385e-05
## 2 12.80031000 TRUE 1.393645e+01
## 3 60.03858000 TRUE 6.557072e+01
## 4 42.55368000 TRUE 4.647471e+01
## 5 6.18401800 TRUE 6.753832e+00
## 6 246.18940000 TRUE 2.688740e+02
head(counts_in[[2]])
## bundle_id target_id length
## 1 1 c7241_g2_i1&199..1293@gi|91080531|ref|XP_966681.1| 1665
## 2 2 c13936_g3_i3&491..748@gi|642939413|ref|XP_008193327.1| 798
## 3 2 c13936_g3_i2&187..447@gi|91094671|ref|XP_972558.1| 736
## 4 3 c9333_g2_i2&48..788@gi|91090952|ref|XP_974552.1| 938
## 5 4 c11973_g1_i4&100..2328@gi|642936719|ref|XP_008198552.1| 2385
## 6 5 c3538_g1_i2&316..873@gi|91089463|ref|XP_968466.1| 925
## eff_length tot_counts uniq_counts est_counts eff_counts
## 1 1723.2887 288 288 288 278.25865
## 2 520.5871 92 53 53 81.24289
## 3 436.5583 130 91 130 219.16891
## 4 740.9067 368 368 368 465.89403
## 5 2228.3498 456 456 456 488.05623
## 6 503.4520 550 550 550 1010.52328
## ambig_distr_alpha ambig_distr_beta fpkm fpkm_conf_low
## 1 0.000000e+00 0.000000 16.72554 16.725540
## 2 2.000001e-06 1.999999 10.18892 8.291442
## 3 2.999997e+00 0.000003 29.80209 23.762440
## 4 0.000000e+00 0.000000 49.70842 49.708420
## 5 0.000000e+00 0.000000 20.47987 20.479870
## 6 0.000000e+00 0.000000 109.33270 109.332700
## fpkm_conf_high solvable tpm
## 1 16.72554 TRUE 19.28159
## 2 12.08640 TRUE 11.74602
## 3 35.84175 TRUE 34.35655
## 4 49.70842 TRUE 57.30503
## 5 20.47987 TRUE 23.60967
## 6 109.33270 TRUE 126.04130
So we go ahead and sort each of the data files based on gene name (the second column of each file).
counts_sorted <- lapply(counts_in, function(x) x <- x[order(x[,2]),])
#make matrix of counts
Note I am using total counts, not unique counts for this analysis. Since there were few multi-mapped reads based on the reference transcriptome I used, it seemed the most sensible option. However, this does make sense for most transcriptomes, so beware!!!! We will discuss this issue in detail next week and the costs and benefits of different approaches to dealing with multi-mapped reads.
tot_count_matrix <- matrix(unlist(lapply(counts_sorted, function(x) x$tot_counts)),
nrow=5076, ncol=16)
Let’s take a peak at this file
head(tot_count_matrix)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 441 441 459 252 381 294 528 579 532 377 544 464 290
## [2,] 32 51 26 31 43 9 50 39 41 14 27 69 15
## [3,] 1917 2660 4777 546 4523 32 2887 1542 3518 8273 2518 2358 5356
## [4,] 61 77 45 45 48 45 104 82 77 48 73 81 78
## [5,] 7 7 17 16 12 31 13 35 34 4 31 1 9
## [6,] 22 15 12 4 26 225 610 148 10 5 8 22 7
## [,14] [,15] [,16]
## [1,] 546 282 465
## [2,] 21 25 23
## [3,] 7039 2021 9060
## [4,] 47 51 96
## [5,] 2 3 12
## [6,] 6 51 6
Now we need to set up the R object for the experimental design,to go along with the counts.
First we start with some data parsing.
I always name my files with an underscore as delimiter to make life easy when I need to break them up for groups. I will know use the strsplit() to split the file names based on sex, size and tissue (which we will not use for this tutorial).
parse_names <- strsplit(in_dir, split="_")
Let’s take a look at the object
parse_names
## [[1]]
## [1] "F101" "lg" "female" "hdhorn"
## [5] "CGATGT" "L008" "results.xprs"
##
## [[2]]
## [1] "F105" "lg" "female" "hdhorn"
## [5] "GCCAAT" "L006" "results.xprs"
##
## [[3]]
## [1] "F131" "lg" "female" "hdhorn"
## [5] "CGATGT" "L001" "results.xprs"
##
## [[4]]
## [1] "F135" "sm" "female" "hdhorn"
## [5] "GTTTCG" "L003" "results.xprs"
##
## [[5]]
## [1] "F136" "sm" "female" "hdhorn"
## [5] "GTGAAA" "L007" "results.xprs"
##
## [[6]]
## [1] "F196" "sm" "female" "hdhorn"
## [5] "AGTTCC" "L002" "results.xprs"
##
## [[7]]
## [1] "F197" "sm" "female" "hdhorn"
## [5] "CAGATC" "L004" "results.xprs"
##
## [[8]]
## [1] "F218" "lg" "female" "hdhorn"
## [5] "CGATGT" "L005" "results.xprs"
##
## [[9]]
## [1] "M120" "sm" "male" "hdhorn"
## [5] "ATGTCA" "L004" "results.xprs"
##
## [[10]]
## [1] "M125" "lg" "male" "hdhorn"
## [5] "TGACCA" "L006" "results.xprs"
##
## [[11]]
## [1] "M160" "lg" "male" "hdhorn"
## [5] "TTAGGC" "L005" "results.xprs"
##
## [[12]]
## [1] "M171" "sm" "male" "hdhorn"
## [5] "GTTTCG" "L007" "results.xprs"
##
## [[13]]
## [1] "M172" "sm" "male" "hdhorn"
## [5] "ATCACG" "L003" "results.xprs"
##
## [[14]]
## [1] "M180" "lg" "male" "hdhorn"
## [5] "CAGATC" "L001" "results.xprs"
##
## [[15]]
## [1] "M200" "sm" "male" "hdhorn"
## [5] "ACAGTG" "L008" "results.xprs"
##
## [[16]]
## [1] "M257" "lg" "male" "hdhorn"
## [5] "ATGTCA" "L002" "results.xprs"
Now we make a matrix of the names of treatment groups (for our experimental design matrix).
parse_names <- matrix(unlist(parse_names), nrow=16, ncol=7, byrow=T)
Since we need sample names as column names, I reconstructed a simple version of the unique names
col_names_counts <- paste(parse_names[,1], "_", parse_names[,2], "_", parse_names[,3], "_", parse_names[,4], sep="")
Finally I add column names to the counts
colnames(tot_count_matrix) = col_names_counts # sample names as column names
# add contig names as row names
rownames(tot_count_matrix) = counts_sorted[[1]]$target_id
We check to make sure the matrix size makes sense.
dim(tot_count_matrix)
## [1] 5076 16
Now we set up the data frame for the experimental design. All of the info we need is in the parse_names variable.Let’s take a look at parse_names again.
head(parse_names)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] "F101" "lg" "female" "hdhorn" "CGATGT" "L008" "results.xprs"
## [2,] "F105" "lg" "female" "hdhorn" "GCCAAT" "L006" "results.xprs"
## [3,] "F131" "lg" "female" "hdhorn" "CGATGT" "L001" "results.xprs"
## [4,] "F135" "sm" "female" "hdhorn" "GTTTCG" "L003" "results.xprs"
## [5,] "F136" "sm" "female" "hdhorn" "GTGAAA" "L007" "results.xprs"
## [6,] "F196" "sm" "female" "hdhorn" "AGTTCC" "L002" "results.xprs"
experimental_design = data.frame(
sample_names = col_names_counts, # sample name
individual = factor(parse_names[,1]), # each individual beetle
size = factor(parse_names[,2]), # small or large beetle
sex = factor(parse_names[,3]), # male or females
#tissue = factor(parse_names[,4]), # tissue type, we don't need it.
lane = factor(parse_names[,6]) # Whick lane on the Illumina flowcell.
)
Before we begin any real analysis. It pays to take some looks at the data. I am not going to go through a full exploratory data analysis session here. But some obvious plots
It is well known that there can be substantial lane to lane variation. For this experiment, it was designed so that 8 samples were run in each lane (barcoded), in a complete block randomized design. This enables us to control for lane effects if necessary. So let’s take a look.
First we create a DESeq data object using our counts, experimental design and a simple statistical model (more on this later)
test_lane_effects <- DESeqDataSetFromMatrix(tot_count_matrix, experimental_design,
design = formula(~ lane))
test_lane_effects2 <- DESeq(test_lane_effects) # We know fit the simple model
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
This generates a fairly complex object
#str(test_lane_effects2)
For the moment we can ask whether any genes show evidence of different expression based solely on lane to lane variation.
test_lane_effects2_results <- results(test_lane_effects2)
summary(test_lane_effects2_results) # No evidence, but this is a bit incomplete
##
## out of 5076 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 0, 0%
## LFC < 0 (down) : 0, 0%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0.1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
We can also plot the mean-dispersion relationship for this data.
plotDispEsts(test_lane_effects2)
Let’s talk about what this means.
We can also use some multivariate approaches to look at variation. For PCA (checking it with a “blind” dispersion estimate to look for any funky effects. Not for biological inference).
for_pca <- rlog(test_lane_effects2, blind=TRUE)
rlog is one approach to asjusting for both library size and dispersopm among samples. blind=TRUE, has it ignore information from the model (in this case lane).
plotPCA(for_pca, intgroup=c("lane")) # no obvious lane effects.
and now for sex and size
plotPCA(for_pca, intgroup=c("sex", "size"))
For distance matrix for clustering QC
rlogMat <- assay(for_pca) # just making a matrix of the counts that have been corrected for over-dispersion in a "blind" fashion
distsRL <- dist(t(rlogMat)) # Computes a distance matrix (Euclidian Distance)
mat <- as.matrix(distsRL) # Make sure it is a matrix
We need to rename our new matrix of distances based on the samples.
rownames(mat) <- colnames(mat) <- with(colData(test_lane_effects2), paste(sex, size, lane, sep=" : "))
hc <- hclust(distsRL) # performs hierarchical clustering
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) # picking our colours
Now we generate the plot
heatmap.2(mat, Rowv=as.dendrogram(hc),
symm=TRUE, trace="none",
col = rev(hmcol), margin=c(13, 13))
While I checked that there was no evidence of lane effects (see the Likelihood Ratio Test below), I am keeping it in the model as it seems to have little effects. However, it is using up “df” (in the parameter estimation sense), so it may be worth ultimately getting rid of it.
Given the results from above, I am removing lane entirely.
DESeq_data <- DESeqDataSetFromMatrix(tot_count_matrix, experimental_design,
design = formula(~ sex + size + sex:size))
DESeq_data <- DESeq(DESeq_data)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- standard model matrices are used for factors with two levels and an interaction,
## where the main effects are for the reference level of other factors.
## see the 'Interactions' section of the vignette for more details: vignette('DESeq2')
While everything is stored, by default DESeq2 is evaluating the final term in the model. In this case evidence of interactions between sex and size. We can look at thse
plotDispEsts(DESeq_data)
plotMA(DESeq_data, ylim=c(-2,2))
Let’s look at the results
res1 <- results(DESeq_data, pAdjustMethod="BH") # How do we deal with correcting for multiple comparisons.
head(res1)
## log2 fold change (MAP): sexmale.sizesm
## Wald test p-value: sexmale.sizesm
## DataFrame with 6 rows and 6 columns
## baseMean
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 422.06820
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 32.42903
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| 3697.37582
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 65.73021
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| 14.23623
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| 61.02054
## log2FoldChange
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 0.149358222
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 0.446694320
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| -0.064080820
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 0.276591281
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| -0.073799880
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| -0.005066566
## lfcSE
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 0.2070437
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 0.2840509
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| 0.2287034
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 0.2651469
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| 0.2190148
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| 0.2041956
## stat
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 0.72138489
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 1.57258528
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| -0.28019184
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 1.04316248
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| -0.33696296
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| -0.02481232
## pvalue
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 0.4706727
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 0.1158149
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| 0.7793303
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 0.2968730
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| 0.7361448
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| 0.9802047
## padj
## <numeric>
## c1_g1_i1&514..1527@gi|91093034|ref|XP_970412.1| 0.9999802
## c100_g1_i1&108..818@gi|91091192|ref|XP_972192.1| 0.9999802
## c10000_g1_i2&1070..1519@gi|189240491|ref|XP_001810703.1| 0.9999802
## c10003_g2_i1&57..953@gi|189237943|ref|XP_001811459.1| 0.9999802
## c10004_g1_i5&77..784@gi|642924690|ref|XP_008194399.1| 0.9999802
## c10006_g1_i1&290..1591@gi|642920893|ref|XP_973626.3| 0.9999802
summary(res1) # only one gene which seems to be showing evidence of significant differences between size and sex
##
## out of 5076 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 0, 0%
## LFC < 0 (down) : 1, 0.02%
## outliers [1] : 37, 0.73%
## low counts [2] : 0, 0%
## (mean count < 0.1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
resultsNames(DESeq_data)
## [1] "Intercept" "sex_male_vs_female" "size_sm_vs_lg"
## [4] "sexmale.sizesm"
Just to keep in mind. We expect a priori, with no true “signifcant” hits an approximately uniform distribution that looks like
p_fake <- rbeta(5076, 1,1) # you could also use runif(5076,1,1)
hist(p_fake)
But we actually observe
hist(res1$pvalue)
FDR methods exploit this.
We can now look at specific planned contrasts (in this case male VS female).
res_contrast_sex_hdhorn <- results(DESeq_data,
contrast=list(c("sex_male_vs_female")),
pAdjustMethod="BH")
plotMA(res_contrast_sex_hdhorn)
It looks like many more genes are showing evidence of differential expression.
res_contrast_sex_hdhorn <- res_contrast_sex_hdhorn[order(res_contrast_sex_hdhorn$padj),] # Sorting genes based on "significance"... yuck.
head(res_contrast_sex_hdhorn)
## log2 fold change (MAP): sex_male_vs_female effect
## Wald test p-value: sex_male_vs_female effect
## DataFrame with 6 rows and 6 columns
## baseMean
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 3321.6900
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 18941.0317
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 2272.2644
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| 1274.6929
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| 155.8273
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| 220.3801
## log2FoldChange
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 2.1615325
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 0.7937278
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 0.9505307
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| -0.7479408
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| -2.9141189
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| -2.2872664
## lfcSE
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 0.2084865
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 0.1010133
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 0.1362995
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| 0.1095547
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| 0.4291651
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| 0.3413740
## stat
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 10.367732
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 7.857658
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 6.973841
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| -6.827097
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| -6.790205
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| -6.700177
## pvalue
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 3.476790e-25
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 3.913815e-15
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 3.084028e-12
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| 8.665025e-12
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| 1.119740e-11
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| 2.081678e-11
## padj
## <numeric>
## c13309_g1_i1&381..3329@gi|642922737|ref|XP_008193304.1| 1.751955e-21
## c13895_g1_i1&40..600@gi|642912129|ref|XP_966308.2| 9.860858e-12
## c11799_g1_i3&177..1004@gi|91082327|ref|XP_974606.1| 5.180139e-09
## c9824_g1_i4&86..2518@gi|91092640|ref|XP_969145.1| 1.091577e-08
## c10543_g1_i3&395..1423@gi|642938091|ref|XP_008190378.1| 1.128474e-08
## c13074_g6_i2&1705..2502@gi|189236266|ref|XP_974475.2| 1.748263e-08
How many are differentially regulated?
summary(res_contrast_sex_hdhorn, alpha= 0.05)
##
## out of 5076 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 298, 5.9%
## LFC < 0 (down) : 458, 9%
## outliers [1] : 37, 0.73%
## low counts [2] : 0, 0%
## (mean count < 0.1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Specifying nominal alpha.. Let’s look at the histogram of expected p-values when there is nothing interesting going on.
length(res_contrast_sex_hdhorn$padj[res_contrast_sex_hdhorn$padj < 0.05])
## [1] 793
attr(DESeq_data, "modelMatrixType")
## [1] "standard"
Type of model matrix, expanded or treatment contrast. For more complex models this is very important to understand.
hd_horn <- factor(parse_names[,4])
hd_horn == "hdhorn" # Just setting up a logical statement to use to extract the columns and rows we need.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE
subsetting for hd horn
tot_count_hd_horn <- tot_count_matrix[ , hd_horn == "hdhorn"]
experimental_design_hd_horn <- experimental_design[hd_horn == "hdhorn",]
#experimental_design_hd_horn$tissue <- droplevels(experimental_design_hd_horn$tissue)
#experimental_design_hd_horn$sample_names <- droplevels(experimental_design_hd_horn$sample_names)
Checking that nothing went amiss
colnames(tot_count_hd_horn) == experimental_design_hd_horn$sample_names
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE
DESeq_hdhorn <- DESeqDataSetFromMatrix(tot_count_hd_horn, experimental_design_hd_horn,
design = formula(~ sex))
DESeq_hdhorn$sex <- relevel(DESeq_hdhorn$sex, "male" )
DESeq_hdhorn <- DESeq(DESeq_hdhorn)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 93 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing
res_hd_horn <- results(DESeq_hdhorn, alpha=0.05)
attr(DESeq_hdhorn , "modelMatrixType")
## [1] "expanded"
Let’s take a look
summary(res_hd_horn, alpha= 0.05)
##
## out of 5076 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 469, 9.2%
## LFC < 0 (down) : 387, 7.6%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 0.1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
plotMA(res_hd_horn, ylim=c(-3,3))
And the contrast
contrast_hd_horn_sex <- results(DESeq_hdhorn, contrast=c("sex", "male", "female"))
res_hd_horn <- res_hd_horn[order(res_hd_horn$padj),]
hd_horn_SSD <- res_hd_horn[res_hd_horn$padj < 0.05,]
hd_horn_SSD_df <- as.data.frame(hd_horn_SSD) # make data frame to make more usable.
hd_horn_SSD_df$FB.protein <- rownames(hd_horn_SSD_df)
hdhorn_dat <- data.frame(res_hd_horn@listData$baseMean, res_hd_horn@listData$log2FoldChange, res_hd_horn@listData$padj )
colnames(hdhorn_dat) <- c("baseMean", "log2FoldChange", "padj" )
Some expression mean plots
plot(y = log2(res_hd_horn$baseMean), x =(log2(res_hd_horn$baseMean) - res_hd_horn$log2FoldChange ),
ylab = "mean expression males", xlab = "mean expression females", xlim=c(-5, 20), ylim=c(-5, 20),
main = "head horn", pch = 20, col ="grey")
with(hdhorn_dat[hdhorn_dat$padj <0.05, ],
points(y = log2(baseMean),x =(log2(baseMean) - log2FoldChange ),
pch=20, col = "black" ))
abline(a=0, b=1 , col="blue")
plot(y = res_hd_horn$log2FoldChange, x =log2(res_hd_horn$baseMean) ,
ylab = "log2 fold change (M/F)", xlab = "mean expression males",
main = " MAplot sex differences in the head horn", pch = 20, col ="grey")
with(hdhorn_dat[hdhorn_dat$padj <0.05, ],
points(y = log2FoldChange,x = log2(baseMean),
pch=20, col = "black" ))
abline(a=0, b=0 , col="blue")
Notice something different about fitting the sex effect as a contrast in a larger model VS as its own model?